74
Algorithms for Binary Neural Networks
For each term in Eq. 3.112, we have:
∂LS
∂kl,i
n
= ∂LS
∂ˆkl,i
n
∂ˆkl,i
n
∂(wl ◦kl,i
n )
∂(wl ◦kl,i
n )
∂kl,i
n
= ∂LS
∂ˆkl,i
n
◦1−1≤wl◦kl,i
n ≤1 ◦wl,
(3.113)
∂LB
∂kl,i
n
= λ{wl ◦
wl ◦kl,i
n −ˆkl,i
n
+ ν[(σl
i)−2 ◦(kl
i+ −μl
i+)
+ (σl
i)−2 ◦(kl
i−+ μl
i−)],
(3.114)
where 1 is the indicator function that is widely used to estimate the gradient of nondiffer-
entiable parameters [199], and (σl
i)−2 is a vector whose elements are all equal to (σl
i)−2.
Updating wl: Unlike the forward process, w is used in backpropagation to calculate the
gradients. This process is similar to the way to calculate ˆx from x asynchronously. Specifi-
cally, δwl is composed of the following two parts:
δwl = ∂L
∂wl = ∂LS
∂wl + ∂LB
∂wl .
(3.115)
For each term in Eq. 3.115, we have:
∂LS
∂wl =
Il
i=1
NIl
n=1
∂LS
∂ˆkl,i
n
∂ˆkl,i
n
∂(wl ◦kl,i
n )
∂(wl ◦kl,i
n )
∂wl
=
Il
i=1
NIL
n=1
∂LS
∂ˆkl,i
n
◦1−1≤wl◦kl,i
n ≤1 ◦kl,i
n ,
(3.116)
∂LB
∂wl = λ
Il
i=1
NIl
n=1
(wl ◦kl,i
n −ˆkl,i
n ) ◦kl,i
n .
(3.117)
Updating μl
i and σl
i: Note that we use the same μl
i and σl
i for each kernel (see Section
3.2). So, the gradients here are scalars. The gradients δμl
i and δσl
i are calculated as:
δμl
i = ∂L
∂μl
i
= ∂LB
∂μl
i
=
λν
Cl
i ×Hl×W l
Cl
i
n=1
Hl×W l
p=1
(σl
i)−2(μl
i −kl,i
n,p),
kl,i
n,p ≥0,
(σl
i)−2(μl
i + kl,i
n,p),
kl,i
n,p < 0,
(3.118)
δσl
i = ∂L
∂σl
i
= ∂LB
∂σl
i
=
λν
Cl
i×Hl×W l
Cl
i
n=1
Hl×W l
p=1
−(σl
i)−3(kl,i
n,p−μl
i)2+(σl
i)−1,kl,i
n,p ≥0,
−(σl
i)−3(kl,i
n,p+μl
i)2+(σl
i)−1,kl,i
n,p <0,
(3.119)
where kl,i
n,p, p ∈{1, ..., Hl ×W l}, denotes the p-th element of kl,i
n . In the fine-tuning process,
we update cm using the same strategy as center loss [245]. The update of σm,n based on
LB is straightforward and is not elaborated here for brevity.